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Abstract. The sharpened No-Free-Lunch-theorem (NFL-theorem) 
states that the performance of all optimization algorithms averaged 
over any finite set F of functions is equal if and only if F is closed 
under permutation (c.u.p.) and each target function in F is equally 
likely. In this paper, we first summarize some consequences of this 
theorem, which have been proven recently: The average number 
of evaluations needed to find a desirable (e.g., optimal) solution 
can be calculated; the number of subsets c.u.p. can be neglected 
compared to the overall number of possible subsets; and problem 
classes relevant in practice are not likely to be c.u.p. Second, as 
the main result, the NFL-theorem is extended. Necessary and suf- 
ficient conditions for NFL-results to hold are given for arbitrary, 
non-uniform distributions of target functions. This yields the most 
general NFL-theorem for optimization presented so far. 



1 Introduction 

Search heuristics such as evolutionary algorithms, grid search, simulated anneal- 
ing, and tabu search are general in the sense that they can be applied to any 
target function / : X — s- y, where X denotes a finite search space and y is a 
finite set of totally ordered cost-values. Much research is spent on developing 
search heuristics that are superior to others when the target functions belong to 
a certain class of problems. But under which conditions can one search method 
be better than another? The No-Free-Lunch-theorem for optimization (NFL- 
theorem) roughly speaking states that all non-repeating search algorithms have 
the same mean performance when averaged uniformly over all possible objective 
functions / : X — > y [11)171111611) . Of course, in practice an algorithm need not 
perform well on all possible functions, but only on a subset that arises from the 
real- world problems at hand, e.g., optimization of neural networks. Recently, a 
sharpened version of the NFL-theorem has been proven that states that NFL- 
results hold (i.e., the mean performance of all search algorithms is equal) for 



any subset F of the set of all possible functions if and only if F is closed under 
permutation (c.u.p.) and each target function in F is equally likely 8 . 

In this paper, we address the following basic questions: When all algorithms 
have the same mean performance — how long does it take on average to find a 
desirable solution? How likely is it that a randomly chosen subset of functions 
is c.u.p., i.e., fulfills the prerequisites of the sharpened NFL-theorem? Do con- 
straints relevant in practice lead to classes of target functions that are c.u.p.? 
And finally: How can the NFL-theorem be extended to non-uniform distribu- 
tions of target functions? Answers to all these questions are given in the sections 
OltoEl First, the scenario considered in NFL-theorems is described formally. 



2 Preliminaries 
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non-repeating black-box search algorithm a 
T m = ((xi,f(xi)),...,(x m ,f(x m ))) 



Fig. 1. Schema of the optimization scenario considered in NFL-theorems. A non- 
repeating black-box search algorithm a chooses a new exploration point in the 
search space depending on the sequence T m of the already visited points with 
their corresponding cost- values. The target function / returns the cost- value of a 
candidate solution as the only information. The performance of a is determined 
using the performance measure c, which is a function of the sequence Y(f, m, a) 
containing the cost- values. 



A finite search space X and a finite set of cost-values y are presumed. Let 
T be the set of all objective functions / : X — ► y to be optimized (also called 
target, fitness, energy, or cost functions). NFL-theorems are concerned with non- 
repeating black-box search algorithms (referred to as algorithms) that choose a 
new exploration point in the search space depending on the history of prior 
explorations: The sequence T m = {(x i ,f(x 1 )), (x 2 ,f(x 2 )), (x m ,f(x m ))) rep- 
resents m pairs of different search points Xi 6 X, Vi,j : Xi ^ Xj and their 
cost-values f(Xi) G y. An algorithm a appends a pair (x m +i, f(x m +i)) to this 
sequence by mapping T rn to a new point x m+ i, Vi : x rn+ i ^ x^. In many search 



heuristics, such as evolutionary algorithms or simulated annealing in their canon- 
ical form, it is not ensured that a point in the search space is evaluated only 
once. However, these algorithms can become non-repeating when they are cou- 
pled with a search-point database, see [3] for an example in the field of structure 
optimization of neural networks. 

The performance of an algorithm a after m iterations with respect to a func- 
tion / depends only on the sequence Y(f, m, a) = (f(xi), /(a^), . . . , f(x m )) of 
cost-values, the algorithm has produced. Let the function c denote a performance 
measure mapping sequences of cost-values to the real numbers. For example, in 
the case of function minimization a performance measure that returns the min- 
imum cost-value in the sequence could be a reasonable choice. See Fig. for a 
schema of the scenario assumed in NFL-theorems. 

Using these definitions, the original NFL-theorem for optimization reads: 

Theorem 1 (NFL-theorem [11 ). For any two algorithms a and b, any k G 

R, any m £ {1, . . . ,\X\}, and any performance measure c 



Herein, 8 denotes the Kronecker function (S(i,j) = 1 if i = j, 5(i,j) = other- 
wise). Proofs can be found in jlUlllHj) . This theorem implies that for any two 
(deterministic or stochastic, cf. p^) algorithms a and b and any function f a G T , 
there is a function € F on which b has the same performance as a on f a . Hence, 
statements like "Averaged over all functions, my search algorithm is the best" 
are misconceptions. Note that the summation in corresponds to uniformly 
averaging over all functions in J 7 , i.e., each function has the same probability to 
be the target function. 

Recently, theorem^has been extended to subsets of functions that are closed 
under permutation (c.u.p.). Let n : X — > X be a permutation of X. The set of 
all permutations of X is denoted by 77 (X). A set F C T is said to be c.u.p. if 
for any 7r G 77(<Y) and any function / G F the function / o it is also in F. 

Example 1. Consider the mappings {0, l} 2 — > {0, 1}, denoted by f , /i, . . . , /is 
as shown in tableffl Then the set {f u f 2 , U, fs} is c.u.p., also {/ , /i,/2, U, fa}. 
The set {/i, /a, /3, fi, fa} is not c.u.p., because some functions are "missing", 
e.g., /s, which results from fs by switching the elements (0, 1) T and (1,0) T . 

In jS] it is proven: 

Theorem 2 (sharpened NFL-theorem [8 ). For any two algorithms a and 
b, any k G R, any m *E {1, . . . ,\X\}, and any performance measure c 



]T S(k, c(Y(f, m, a))) = ]T 5(k, c(Y(f, m, &))) . 



(1) 




S(k, c(Y(f, m, a))) = £ S(k, c(Y(f, m, b))) 



(2) 




iff F is c.u.p. 



This is an important extension of theorem ^ because it gives necessary and 
sufficient conditions for NFL-results for subsets of functions. But still theorem 
12 can only be applied if all elements in T have the same probability to be the 
target function, because the summations average uniformly over F. 

In the following, the concept of 3^-histograms is useful. A y -histogram (his- 
togram for short) is a mapping h : y — > No such that ^2 yG y h(y) = \X\. The set 
of all histograms is denoted TL. Any function / : X — > y implies the histogram 
hf(y) — |/ (y)| that counts the number of elements in X that are mapped 
to the same value y S y by /. Herein, f~ l (y),y € y returns the preimage 
{x\f(x) = y} of y under /. Further, two functions /, g are called h-equivalent iff 
they have the same histogram. The corresponding ^.-equivalence class Bh C T 
containing all functions with histogram h is termed a basis class. 

Example 2. Consider the functions in table ^ The ^-histogram of /i contains 
the value zero three times and the value one one time, i.e., we have hf ± (0) = 3 and 
hf 1 (1) = 1. The mappings /i, fa, f±, /g have the same J^-histogram and are there- 
fore in the same basis class B hfi = {/i, / 2 , U, fs}- The set (f u f 2 , U, fs, /i 5 } is 
c.u.p. and corresponds to Bh h U Bh fl5 ■ 

It holds: 

Lemma 1 ([5j). 

(a) Any subset F C T that is c.u.p. is uniquely defined by a union of pairwise 
disjoint basis classes. 

(b) Bh is equal to the permutation orbit of any function f with histogram h, i.e., 

B h = (J {fon} . (3) 

A proof is given in [S] . 

3 Time to Find a Desirable Solution 

Theorem |21 tells us that on average all algorithms need the same time to find a 
desirable, say optimal, solution — but how long does it take? The average number 



Table 1. Functions {0, l} 2 -> {0, 1}. 
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of evaluations, i.e., the mean first hitting time E{T}, needed to find an optimum 
depends on the cardinality of the search space \X\ and the number n of search 
points that are mapped to a desirable solution. 

Let F n C T be the set of all functions where n elements in X are mapped to 
optimal solutions. For non-repeating black-box search algorithms it holds: 

Theorem 3 ([4 ). Given a search space of cardinality \X\ the expected number 
of evaluations E{T\x\. n } averaged over F„ C T is given by 

E{T lx] , n} = J*j±l . (4) 
11 n + 1 

A proof can be found in 0|, where this result is used to study the influence of 
neutrality (i.e., of non-injective genotype-phenotype mappings) on the time to 
find a desirable solution. 



4 Fraction of Subsets Closed under Permutation 

The NFL-theorems can be regarded as the basic skeleton of combinatorial opti- 
mization and are important for deriving theoretical results as the one presented 
in the previous section. However, are the preconditions of the NFL-theorems ever 
fulfilled in practice? How likely is it that a randomly chosen subset is c.u.p.? 

There exist 2v' 3; ' ) — 1 non-empty subsets of T and it holds: 

Theorem 4 ([5 ). The number of non-empty subsets of y x that are c.u.p. is 

given by 

2^ > -1 (5) 

and therefore the fraction of non-empty subsets c.u.p. is given by 

(2W- 1 ) - l) / (2(^ W ) - l) . (6) 

The proof is given in [5]. 

Figure [21 shows a plot of the fraction of non-empty subsets c.u.p. versus the 
cardinality of X for different values of \y\. The fraction decreases for increasing 
\X\ as well as for increasing \y\. More precisely, for \y\ > e|Af|/(|#| — e) it 
converges to zero double exponentially fast with increasing \X\. Already for 
small X and \y\ the fraction almost vanishes. 

Thus, the statement "I'm only interested in a subset F of all possible func- 
tions, so the precondition of the sharpened NFL-theorems is not fulfilled" is true 
with a probability close to one (if F is chosen uniformly and y and X have 
reasonable cardinalities). The fact that the precondition of the NFL-theorem is 
violated does not lead to "Free Lunch" , but nevertheless ensures the possibility 
of a "Free Appetizer" . 



1(T 5 
. HT 10 
6 10 - 15 . 

CO 
+= 
CP 

I HT 20 

CO 

° ID" 25 - 
_o 

1 lO" 30 - 

1Q -35 



1Q -40 



b |y|=3 



m=4 



4 

\X\ 



\y\ = 2 



Fig. 2. The ordinate gives the fraction of subsets closed under permutation 
on logarithmic scale given the cardinality of the search space X. The different 
curves correspond to different cardinalities of the codomain y. 



5 Search Spaces with Neighborhood Relations 

Although the fraction of subsets c.u.p. is close to zero already for small search 
and cost-value spaces, the absolute number of subsets c.u.p. grows rapidly with 
increasing \X\ and \y\. What if these classes of functions are the relevant ones, 
i.e., those we are dealing with in practice? 

Two assumptions can be made for most of the functions relevant in real- 
world optimization: First, the search space has some structure. Second, the set 
of objective functions fulfills some constraints defined based on this structure. 
More formally, there exists a non-trivial neighborhood relation on X based on 
which constraints on the set of functions under consideration are formulated, e.g., 
concepts like ruggedness or local optimality and constraints like upper bounds 
on the ruggedness or on the maximum number of local minima can be defined. 

A neighborhood relation on X is a symmetric function n : X x X — > {0, 1}. 
Two elements Xj,,Xj E X are called neighbors iff n{Xi, Xj) = 1. A neighborhood 
relation is called non-trivial iff 3xi,Xj E X : Xi ^ Xj A n(xi,Xj) — 1 and 
3xk,xi E X : Xk 7^ xi A n[xk,Xi) — 0. It holds: 

Theorem 5 ([5 ). A non-trivial neighborhood relation on X is not invariant 
under permutations of X . 

This result is quite general. Assume that the search space X can be de- 
composed as X = X\ X • • • x Xi, I > 1, and let on one component Xi exist a 



non-trivial neighborhood rij : Xi x X% — > {0,1}. This neighborhood induces a 
non-trivial neighborhood on ,Y, where two points are neighbored iff their z-th 
components are neighbored with respect to rij. Thus, the constraints discussed 
below need only refer to a single component. Note that the neighborhood rela- 
tion need not be the canonical one (e.g., Hamming-distance for Boolean search 
spaces). For example, if integers are encoded by bit-strings, then the bit-strings 
can be defined as neighbored iff the corresponding integers are. 

Some constraints that are defined with respect to a neighborhood relation 
and that are relevant in practice are now discussed, cf. [5]. For this purpose, a 
metric dy : y x y — > 1R on y is presumed, e.g., in the typical case of real- valued 
target functions JcB the Euclidean distance. 

A constraint on steepness leads to a set of functions that is not c.u.p. Based 
on a neighborhood relation on the search space, we can define a simple measure of 
maximum steepness of a function / e T by the maximum distance of the target 
values of neighbored points s max (/) = max^^g* A n(x*,x 3 )=i dy(f(xi), f(xj)). 
Further, for a function / 6 F, the diameter of its range can be defined as 
cT ax (/) = ™xx>, xj ex dy(f(xi), f( Xj )). 

Corollary 1 (|_5j). If the maximum steepness s max (/) of every function f in a 
non-empty subset F C T is constrained to be smaller than the maximal possible 
max f G p d max (f), then F is not c.u.p. 

Consider the number of local minima, which is often regarded as a measure 
of complexity [§]. For a function / S T a point x G X is a local minimum iff 
f(x) < f(xi) for all neighbors Xi of x. Given a function / and a neighborhood 
relation on X, let l max (f) be the maximal number of minima that functions 
with the same ^-histogram as / can have (i.e., functions where the number of 
X- values that are mapped to a certain y~ value are the same as for /). 

Corollary 2 ( [5j ) . If the number of local minima of every function f in a non- 
empty subset F C T is constrained to be smaller than the maximal possible 
max/ e i? P lax (/), then F is not c.u.p. 

Example 3. Consider all mappings {0, l} e — > {0, 1} that have less than the max- 
imum number of 2"~ 1 local minima w.r.t. the ordinary hypercube topology on 
{0, 1}^. This means, this set does not contain mappings such as the parity func- 
tion, which is one iff the number of ones in the input bitstring is even. This set 
is not c.u.p. 

Hence, statements like "In my application domain, functions with maximum 
number of local minima are not realistic" and "For some components, the objec- 
tive functions under consideration will not have the maximal possible steepness" 
lead to scenarios where the precondition of the NFL-theorem is not fulfilled. 

6 A Non-Uniform NFL-theorem 

In the sharpened NFL-theorem it is implicitly presumed that all functions in 
the subset F are equally likely since averaging is done by uniform summation 



over F. Here, we investigate the general case when every function / G T has 
an arbitrary probability p{f) to be the objective function. Such a non-uniform 
distribution of the functions in F appears to be much more realistic. Until now, 
there exist only very weak results for this general scenario. For example, let for 
all x G X and y G y 

Br(l/):=$>(/W0»O,l/) , (7) 

i.e., p x (y) denotes the probability that the search point x is mapped to the cost- 
value y. In [2] it has been shown that a NFL-result holds if within a class of 
functions the function values are i.i.d., i.e., if 

Vxi,x 2 G X : p Xl =p X2 and p Xl , X2 = p Xl Px 2 , (8) 

where p Xl X2 is the joint probability distribution of the function values of the 
search points X\ and x%- However, this is not a necessary condition and applies 
only to extremely "unstructured" problem classes. 

The following theorem gives a necessary and sufficient condition for a NFL- 
result in the general case of non- uniform distributions: 

Theorem 6 (non-uniform sharpened NFL). For any two algorithms a and 
b, any value k G R, and any performance measure c 

£ P(f) S(k, c(Y(f, m, a))) = ]T p(f) S(k, c(Y(f, m, b))) (9) 
fe? ft? 

iff for all h 

f,geB h ^p(f)= P (g) . (10) 

Proof. First, we show that (|10|l implies that © holds for any a, b, k, and c. It 
holds by lemma 

P(f) S(k, c(Y(f, m, a))) = J2 P(/) S ( k ' c ( y (/' m < a ))) ( n ) 
fer h&Hf&B h 

using /, g G B h => p{f) = p(g) = p h 

= X>» E S(k,c(X(f,m,a))) (12) 
hen feB h 

as each Bh is c.u.p. we may use theorem [5] 

= E S(k,c(Y(f,m,b))) (13) 

h€LH f£B h 

= ^p(f)6(k,c(Y(f,m,b))) ■ (14) 



Now we prove that © being true for any a, b, c, and k implies IjlOU by showing 
that if HI U| l is not fulfilled then there exist a, b, c, and k such that is also not 
valid. Let f,g G B h , / ^ g, p(f) ^ _p(g), and g = / o tt. Let X = . . . ,£„}. 
Let a be an algorithm that always enumerates the search space in the order 
£1 , . . . , £„ regardless of the observed cost- values and let b be an algorithm that 
enumerates the search space always in the order 7r~ 1 (^ 1 ), . . . , 7r -1 (£„). It holds 
fl'( 7r_1 (£i)) = /(&) for i = 1, ... ,n and Y(f, n, a) = Y(g, n, b). We consider the 
performance measure 

cH( yi , . . y m) ) = \l if ™ = ^ ' • • ' ^ = ' /(e " }> (15) 

I otherwise 

for any yi, . . . , y m G Then, for m = n and fc = 1, we have 

^p(f')8(k,J(Y(f,n,a)))=p(f) , (16) 

as /' = / is the only function /' G T that yields 

(/'«!),...,/'(€«)> = </(€i), •••./(€«)> . (17) 

and 

^K/')^ct(F(/',n,6)))=K5) , (18) 
/'e^- 
and therefore © does not hold. □ 

The sufficient condition given in [5] is a special case of theorem El because 
(JSJl implies 

g = fon^p(f)=p(g) (19) 

for any f,g G J- and 7r G II(X), which in turn implies g, f G -B^ =>■ = p(g) 
due to lemma ITl|b)l . 

The probability that a randomly chosen distribution over the set of objective 
functions fulfills the preconditions of theorem has measure zero. This means 
that in this general and realistic scenario the probability that the conditions for 
a NFL-result hold vanishes. 



7 Conclusion 

Several recent results on NFL-theorems for optimization presented in |5I4| were 
summarized and extended. In particular, we derived necessary and sufficient con- 
ditions for NFL-results for arbitrary distributions of target functions and thereby 
presented the "sharpest" NFL theorem so far. It turns out that in this general- 
ized scenario, the necessary conditions for NFL-results can not be expected to 
be fulfilled. 
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